Simplified molecular input line entry specification
The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules. Graph-based definition In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree. Specification A SMILES string consists of characters (in ASCII) without spaces. Atoms Atoms are represented by their element's symbol. For example, mercury is Hg. The elements B, C, N, O, P, S, F, Cl, Br and I (the "organic subset") can be typed without brackets when the number of attached hydrogens conforms to the lowest normal valence consistent with explicit bonds. Where it can be inferred, hydrogen atoms may be omitted. For example, hydrogen chloride (HCl) is just Cl; ammonia is just N (the valence of nitrogen is 3, so three hydrogen atoms are inferred). The hydrogen atom rule only applies to the organic subset without brackets. For comparison, S refers to hydrogen sulfide (H2S, two hydrogen atoms are inferred) while S refers to elemental sulfur (S). Aromatic atoms Atoms in aromatic rings are specified in lowercase. Example: * n ccccc - pyridine Charges Within brackets, any attached hydrogen atoms and formal charges must always be specified. The number of attached hydrogen atoms is shown by the symbol H followed by an optional digit. Similarly, a formal charge is shown by one of the symbols + or -, followed by an optional digit. If unspecified, charge is assumed to be zero. Multiple + or - signs are synonymous with the same sign followed by the charge, for example, Fe++ is also Fe+2. Examples: * H+ - proton (H+) * NH4+ - ammonium (NH4+) * C#N- - cyanide (CN-) Bonds Single bonds, double bonds, triple bonds, and aromatic bonds are represented by , , , and , respectively. Adjacent atoms (in the string) are always assumed to be single or aromatic bonded. Example: * O C O - carbon dioxide (CO2) Branches Branches from any atom in the sequence may be specified using parentheses, and may be nested. Examples: * CC( O)O - acetic acid (CH3COOH) * CC(O)C - isopropyl alcohol (2-propanol) * CC©C( O)O - isobutyric acid Cyclic structures Cyclic "bonds" may be specified by replacing the bond with a reference to the concerned atoms. Example: * C CCCCC - cyclohexane. This associates a string of six carbon atoms with the first atom (numbered 1) and the sixth atom (also numbered 1) bonded together. Multiple bonds may be assigned to a single atom. For example, C means that the carbon atom is assigned to a bond number 1 and another bond number 2 (not bond number 12). Bond numbers can be reused after the "second" atom with the number is typed. This reduces the number of ring closures beyond 10. Should this happen, a percent sign (%) must precede the number. For example, C is a carbon atom with bond number 12. Disconnected structures Disconnected compounds are separated by a dot (.). Isotopes Isotopes can be specified by prefixing with the isotope's atomic mass. Example: * 12C - carbon-12 Stereochemistry Configuration around double bonds is specified using the characters "/" and "\". For example, F C C F is trans-difluoroethene, while F C C F is cis-difluoroethene. References * Daylight Theory - SMILES Category:Chemical file formats Category:Text file formats